Functional Enrichment

We will study whether genes belonging to specific annotated gene sets, such as Gene Ontology (GO) terms, are enriched in the set of genes of interest, in this case the list of differentially expressed genes (DEGs). This analysis is commonly referred to as Over-Representation Analysis (ORA).

To conduct such an analysis, we need:

Loading necessary libraries

pacman::p_load(tidyverse, DESeq2, gprofiler2)

Loading the Data

# The list of DEG results
res<-readRDS("./results/deseq2_regions_results_local.rds")

Loading the Annotations

This annotation file contains all P. cultripes transcripts by rows. As columns, we can find: * The gene IDs for P. cultripes, followed by the transcripts IDs and the peptides IDs (gene_id, transcript_id, peptide_id) * The IDs and descriptions of X. tropicalis annotated proteome resulting from both nucleotide and peptide blasting against P. cultripes transcripts (xenx_pep_id, xenx_gene_symbol, xenx_description, xenp_pep_id, xenp_gene_symbol, xenp_description).

# The annotation file
xtrop<-read.csv("./xtr109/diamondblast109.csv", stringsAsFactors = FALSE)

Creating Gene Sets

However, this annotation file contains multiple annotations per gene. This means, in the columns corresponding to the results of BLASTing the genes, the fields (ID field or description field), we can find more than one element that is associated to one only gene.

g:Profiler is unable to process these entries, therefore we have to process the file so that it can interpret the multiple possible annotations for each gene.

The goal is to split multiple annotations within a single cell, unlist them to handle them individually and ensure uniqueness so that it is understandable for g:Profiler.

We will extract the X. tropicalis gene symbols as input for g:Profiler.

pull_genes<-function(x, alpha=0.05, lfc=0, signed="both", feature="xenp_gene_symbol") {
  
  # filter by adjusted p
  x<-x %>%
    filter(padj<alpha)
  
  # 3 modes for filtering based on log fold change, controlled by the signed parameter

  if(signed=="up"){
    x<-x %>%
      filter(log2FoldChange>lfc) %>%
      pull(feature)
  }  # include only genes where the log fold change is greater than a specified threshold (lfc, which defaults to 0)
  
  if(signed=="down"){
    x<-x %>%
      filter(log2FoldChange<(lfc*-1)) %>%
      pull(feature) 
  }  # include only genes where the log fold change is less than the negative of the specified threshold (-lfc)
  
  if(signed=="both"){
    x<-x %>%
      filter(abs(log2FoldChange)>lfc) %>%
      pull(feature) 
  } # include genes where the absolute value of the log fold change is greater than the specified threshold (lfc), includes changes in either direction.
  
  # deal with multiple gene annotations for the same gene (different transcripts)
  
  x<-strsplit(x, ";") %>% unlist() %>% unique()
  # separates different annotations in different rows
  x<-x[!is.na(x)]
  
  return(x[!is.na(x)])
}

# now extract all

sig_deg<-lapply(res, FUN=function(x) 
  x %>%
  as_tibble(rownames = "gene_id") %>%
  left_join(xtrop) %>%
    pull_genes(feature = "xenp_pep_id", alpha = 0.05, lfc=0, signed = "both")
)
## Joining with `by = join_by(gene_id)`
## Joining with `by = join_by(gene_id)`
str(sig_deg)
## List of 2
##  $ central: chr [1:106] "ENSXETP00000106396" "ENSXETP00000089706" "ENSXETP00000118428" "ENSXETP00000118788" ...
##  $ south  : chr [1:1065] "ENSXETP00000042154" "ENSXETP00000095923" "ENSXETP00000079038" "ENSXETP00000092115" ...

Make background

We will use the full set of genes that were returned by DESeq2. This set should have filtered out genes that have low counts.

We can use the same function from earlier to convert our list of Pelobates IDs to Xenopus peptide IDs.

# function to pull out xtr background IDs
extract_xtr<-function(x) {
  return(
      xtrop %>%
        filter(gene_id %in% x) %>%
        pull(xenp_pep_id) %>%
        str_split(pattern=";") %>%
        unlist() %>%
        na.omit() %>%
        unique()
  )
}

xtr_bg<-lapply(res, FUN= function(x) extract_xtr(rownames(x)))
str(xtr_bg)
## List of 2
##  $ central: chr [1:15183] "ENSXETP00000113033" "ENSXETP00000042154" "ENSXETP00000095923" "ENSXETP00000041999" ...
##  $ south  : chr [1:15183] "ENSXETP00000113033" "ENSXETP00000042154" "ENSXETP00000095923" "ENSXETP00000041999" ...
# lets unify the background to use the same across all populations
xtr_bg_all<-xtr_bg %>%
  unlist() %>%
  unique()

Functional Enrichment Analysis with g:Profiler

We will use g:Profiler tool, which is a web-based tool for functional enrichment analysis of gene lists.

The associated R package for g:Profiler is gprofiler2, which is an API (Application Programming Interface). This API serves as a bridge that allows performing the actual analysis within the g:Profiler server, by using the R environment in RStudio.

g:Profiler is continuously being updated to match the updates in the associated Ensembl and Ensembl Genomes Data Bases. Because of this, it is important to specify which version of g:Profiler, and therefore Ensembl version, we would like to use. We will do this by looking at the archives on the g:Profiler homepage and setting the base url for the desired version.

Our annotations resulted from BLASTing P. cultripes transcriptome against the 109 Ensembl release of the X. tropicalis proteome, therefore we will use Ensembl 109, Ensembl Genomes 56 (database built on 2023-03-29).

# Install and load the gprofiler2 package. This package provides functions to access the gprofiler API directly from R. 

# install.packages("gprofiler2")
library(gprofiler2)

# Setting the base url

# set base url:
set_base_url("https://biit.cs.ut.ee/gprofiler_archive3/e109_eg56_p17/")

# run the analysis
res_ora<-gost(multi_query = FALSE, # returns separate results tables for multiquery
              custom_bg = xtr_bg_all, # our background
              query=sig_deg, # our list of gene sets
              organism="xtropicalis", # the organism our annotations belong to
              exclude_iea = FALSE, # include GO terms that were electronically assigned
              correction_method = "gSCS", # the recommended multiple testing correction.
              sources=c("GO:BP","GO:CC","GO:MF", "KEGG","REAC"), # the functional sets we are interested in 
              evcodes=FALSE, # evcodes TRUE needed for downstream analysis like enrichment maps in Cytoscape, but it takes longer
              significant= FALSE) # return all terms, not just the significant ones
## Detected custom background input, domain scope is set to 'custom'.
# The results are stored as a "results" dataframe 
head(res_ora$result)
##     query significant     p_value term_size query_size intersection_size
## 1 central        TRUE 0.001855428      4558        100                60
## 2 central        TRUE 0.002466472      3863        100                54
## 3 central        TRUE 0.002476485      4221        100                57
## 4 central       FALSE 0.373583457      1244        100                24
## 5 central       FALSE 0.450705336      8450        100                82
## 6 central       FALSE 0.675368296      1200        100                23
##   precision      recall    term_id source                           term_name
## 1      0.60 0.013163668 GO:0065007  GO:BP               biological regulation
## 2      0.54 0.013978773 GO:0050794  GO:BP      regulation of cellular process
## 3      0.57 0.013503909 GO:0050789  GO:BP    regulation of biological process
## 4      0.24 0.019292605 GO:0009889  GO:BP  regulation of biosynthetic process
## 5      0.82 0.009704142 GO:0009987  GO:BP                    cellular process
## 6      0.23 0.019166667 GO:0051252  GO:BP regulation of RNA metabolic process
##   effective_domain_size source_order      parents
## 1                 13686        16811   GO:0008150
## 2                 13686        14008 GO:00099....
## 3                 13686        14004 GO:00081....
## 4                 13686         3827 GO:00090....
## 5                 13686         3889   GO:0008150
## 6                 13686        14368 GO:00160....

Let’s look at the enrichment results in detail

colnames(res_ora$result)
##  [1] "query"                 "significant"           "p_value"              
##  [4] "term_size"             "query_size"            "intersection_size"    
##  [7] "precision"             "recall"                "term_id"              
## [10] "source"                "term_name"             "effective_domain_size"
## [13] "source_order"          "parents"
  • query: This column represents the gene set or query submitted to g:Profiler for enrichment analysis. Each row corresponds to a different gene set or query.

  • significant: This column indicates whether the enrichment analysis identified the term (functional category) as statistically significant. It contains a boolean value (TRUE or FALSE).

  • p_value: This column contains the adjusted p-values. Each p-value indicates the statistical significance of enrichment of a given term in the gene set.

  • term_size: This column represents the total number of genes in a specific term (functional category).

  • query_size: This column represents the total number of genes in the submitted query or gene set.

  • intersection_size: This column represents the number of genes that are shared between the submitted query and the term being analyzed. That is, how many genes from both term size and query size are overlapping.

  • term_id: This column contains the unique identifier (ID) associated with the term (functional category) being analyzed.

  • term_name: This column contains the name or description of the term (functional category) being analyzed.

  • source: This column indicates the data source or database from which the term originates and that we requested (Gene Ontology, KEGG, Reactome).

  • precision: This column represents the proportion of the genes shared between the submitted query and the term being analyzed, relative to the total number of genes in the query. It measures the accuracy of the enrichment analysis. A higher precision value indicates that a larger proportion of the genes in the intersection are relevant to the term being analyzed, relative to the total number of genes in the query. In other words, it reflects how well the term represents the genes in the query.

  • recall: This column represents the proportion of the genes in the intersection, the same as the precision column, but relative to the total number of genes in the term. It reflects how well the term represents the genes in the dataset or background.

  • effective_domain_size: This column represents the effective domain size used in the enrichment analysis. It is a statistical parameter used to calculate the p-value.

  • source_order: This column represents the order or rank of the source (data database) based on its importance or relevance in the enrichment analysis.

  • parents: This column contains information about the parent terms or higher-level categories to which the analyzed term belongs. It may provide additional hierarchical context for the analyzed terms.

Analyzing the results

We can use a p-value cutoff of 0.05 to see how many terms have been functionally enriched in each gene set.

res_ora$result %>%
  filter(p_value<0.05) %>%
  group_by(query) %>%
  dplyr::count(query, sort=TRUE)
## # A tibble: 2 × 2
## # Groups:   query [2]
##   query       n
##   <chr>   <int>
## 1 south      12
## 2 central     9
res_ora$result %>%
  filter(p_value<0.05) %>%
  group_by(query)
## # A tibble: 21 × 14
## # Groups:   query [2]
##    query   significant  p_value term_size query_size intersection_size precision
##    <chr>   <lgl>          <dbl>     <int>      <int>             <int>     <dbl>
##  1 central TRUE         1.86e-3      4558        100                60    0.6   
##  2 central TRUE         2.47e-3      3863        100                54    0.54  
##  3 central TRUE         2.48e-3      4221        100                57    0.57  
##  4 central TRUE         6.62e-5        79        100                 7    0.07  
##  5 central TRUE         4.16e-4       242        100                10    0.1   
##  6 central TRUE         9.11e-3       119        100                 6    0.06  
##  7 central TRUE         3.51e-2        59        100                 4    0.04  
##  8 central TRUE         3.98e-2        61        100                 4    0.04  
##  9 central TRUE         4.91e-2       110        100                 5    0.05  
## 10 south   TRUE         4.82e-2       818        985                96    0.0975
## # ℹ 11 more rows
## # ℹ 7 more variables: recall <dbl>, term_id <chr>, source <chr>,
## #   term_name <chr>, effective_domain_size <int>, source_order <int>,
## #   parents <list>

GOST plot

gostplot(res_ora)

Dotplot

Custom dot plot using the gprofiler results tables and ggplot.

custom_query_names <- res_ora$result %>%
  mutate(query = case_when(
    query == "central" ~ "Less plastic",
    query == "south" ~ "Plastic",
    TRUE ~ query
  ))

custom_query_names %>%
  select(query,term_name, p_value, intersection_size, query_size,source) %>%
  filter(p_value<0.05) %>%
  mutate(GeneRatio=intersection_size/query_size) %>%
  arrange(GeneRatio) %>%
  mutate(term_name = factor(term_name, levels=unique(term_name))) %>%
  ggplot(aes(x=GeneRatio, y=term_name)) +
  geom_point(aes(color=p_value, size=intersection_size)) +
  ylab("") +
  # scale_color_scico(palette = "bamako", direction = 1) +
  facet_grid(source~query,scales = "free_y",space = "free") +
  theme_bw()

ggsave("functional_enrich.jpeg", width = 8, height = 8.5, dpi = 600)